Probabilistic information retrieval in a distributed heterogeneous environment

نویسنده

  • Christoph Baumgarten
چکیده

This thesis describes a probabilistic model for optimum information retrieval in a distributed heterogeneous environment. The model assumes the collection of documents o ered by the environment to be hierarchically partitioned into subcollections. Documents as well as subcollections have to be indexed. At this, indexing methods using di erent indexing vocabularies can be employed. A query provided by a user is answered in terms of a ranked list of documents. The model determines a procedure for ranking the documents that stems from the Probability Ranking Principle: For each subcollection, the subcollection's elements are ranked; the resulting ranked lists are combined into a nal ranked list of documents, where the ordering is determined by the documents' probabilities of being relevant with respect to the user's query. Various probabilistic ranking methods may be involved in the distributed ranking process. The underlying data volume is arbitrarily scalable. A criterion for e ectively limiting the ranking process to a subset of subcollections extends the model. The model's applicability is experimentally con rmed. When exploiting the degrees of freedom provided by the model, experiments showed evidence that the model even outperforms comparable models for the non-distributed case with respect to retrieval e ectiveness. An architecture for a distributed information retrieval system is presented that realizes the probabilistic model. The system provides access to an arbitrary number of dynamic multimedia databases.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

Testing the cluster hypothesis in distributed information retrieval

How to merge and organise query results retrieved from different resources is one of the key issues in distributed information retrieval. Some previous research and experiments suggest that cluster-based document browsing is more effective than a single merged list. Cluster-based retrieval results presentation is based on the cluster hypothesis, which states that documents that cluster together...

متن کامل

A Conceptual Level Design Methodology for Probabilistic Relational Databases

When multiple heterogeneous databases show different values for the same data item, its actual value is not known with certainty. To develop corporate data warehouses, which consolidate data from multiple heterogeneous data sources, has become an important issue for designing modern business information systems. Probabilistic relational databases have extended from the relational database model...

متن کامل

Distributed ontology based information retrieval using semantic web

In recent years, user demand for integrated searches over several independently operating semantic web systems have been increasing rapidly. Integrated semantic searches enable more meaningful results to be generated because information with similar meanings in diverse areas and domains is likely to be used for inference. However, it is not easy to integrate physically independent, distributed,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999